Introduction

This a template for an analysis notebook using RMarkdown.

In this notebook, we will set up parameters in the seurat workflow [normalization –> reduction –> clustering] for one Wilms tumor sample (SCPCS000168) of the Wilms Tumor dataset (SCPCP000006).

This correspond to the step 2 of the proposed analysis: clustering of cells across a set of parameters for few samples

Packages

Load required packages in the following chunk, if needed. Do not install packages here; only load them with the library() function.

library(SingleCellExperiment)
library(Seurat)      # to use Seurat workflow

library(tidyr)
library(dplyr)
library(DT)          # for table visualization

library(ggplot2)     # for visualization
library(patchwork)   # for visualization
library(ggplotify)
library(SCpubr)      # for visualization
library(viridis)     # for visualization - colors

library(edgeR)       # For pseudobulk based differential expression (DE) analysis
library(DElegate)    # for pseudobulk based DE analysis and identification of marker genes

library(Azimuth)     # For label transfer

library(msigdbr)     # for gene set enrichment analysis or pathway enrichment
library(clusterProfiler)

library(data.table)

Some config set-up / threshold

We store in a config list names “cfg” parameters used all along the analysis to filter for p-value, log fold change, percentage of expression, etc.

cfg = list()
cfg$padj_thershold = 0.05
cfg$lfc_threshold = 1
cfg$rate1_threshold = 0.5
set.seed(12345)

Base directories

# The base path for the OpenScPCA repository, found by its (hidden) .git directory
repository_base <- rprojroot::find_root(rprojroot::is_git_root)

# The current data directory, found within the repository base directory
data_dir <- file.path(repository_base, "data", "2024-07-08", "SCPCP000006")

# The path to this module
module_base <- file.path(repository_base, "analyses", "cell-type-wilms-tumor-06")

Input and output files

For this analysis, we worked with the _processed.rds data. We builded a Seurat object based on the counts data and re-perform the analysis [normalization –> reduction –> clustering] following the Seurat workflow.

We transferred meta.data to keep:

[x] QC data computed by the DataLab

[x] annotation data computed by the DataLab

[x] raw annotation and gene_symbol conversion

# input
# use download-data.py to download the data

# get all files in the data directory
filelist <- list.files(data_dir, full.names = TRUE)

# select the 40 Wilms tumor single nucleus folders only
filelist <- filelist[! grepl(".tsv", filelist)]

# output
## html report for each of the Wilms tumor sample will be saved in 
path_to_report <- file.path(module_base, "01-clustering")
## some plots from the report will be saved in 
path_to_plot <- file.path(module_base, "/plots")
## after final decision on the clustering strategy, processed RDS files will be saved in 
path_to_result <- file.path(module_base, "/results")
## to be modified when rendered
i = 1
filedir <- filelist[i] 
sample <- gsub(paste0(data_dir, "/"), "", filedir)

metadata <- read.table(file.path(data_dir, "/single_cell_metadata.tsv"), sep = "\t", header=TRUE)

metadata[metadata$scpca_sample_id == sample,]
scpca_project_id scpca_sample_id scpca_library_id diagnosis subdiagnosis disease_timing age_at_diagnosis sex tissue_location participant_id submitter submitter_id organism development_stage_ontology_term_id sex_ontology_term_id organism_ontology_id self_reported_ethnicity_ontology_term_id disease_ontology_term_id tissue_ontology_term_id metastasis relapse_status treatment vital_status seq_unit technology total_reads mapped_reads sample_cell_count_estimate unfiltered_cells filtered_cell_count processed_cells has_cellhash includes_anndata is_cell_line is_multiplexed is_xenograft pi_name project_title genome_assembly mapping_index alevin_fry_version salmon_version transcript_type droplet_filtering_method cell_filtering_method prob_compromised_cutoff min_gene_cutoff normalization_method date_processed workflow workflow_version workflow_commit
SCPCP000006 SCPCS000168 SCPCL000205 Wilms tumor Anaplastic Initial diagnosis 2.08 M Kidney SJWLM059659 murphy_chen SJWLM059659_D1 Homo sapiens HsapDv:0000096 PATO:0000384 NCBITaxon:9606 unknown MONDO:0006058 UBERON:0002113 NA No Upfront resection Alive nucleus 10Xv3.1 171527928 122977092 23741 72890 23741 22476 False True False False False murphy_chen Single nuclear RNA-seq and spatial transcriptomic analysis of anaplastic and favorable histology Wilms tumor Homo_sapiens.GRCh38.104 Homo_sapiens.GRCh38.104.spliced_intron.txome 0.7.0 1.5.2 [total, spliced] emptyDropsCellRanger miQC 0.75 200 deconvolution 2024-03-18T20:08:06+0000 https://github.com/AlexsLemonade/scpca-nf v0.8.0 8bef82d853d19e5aeddd75401aa54cf8bfbced13

Here we have a sample of anaplastic histology, untreated. Few thoughts:

  • We thus expect anaplasia-associated patterns (TP53, MYCN as an example) to be major driver here.

  • We do not expect high immune infiltration (induced by chemotherapy).

  • We do not expect chemotherapy induced DNA damages, but maybe some DNA-repair pathways up in anaplstic cells.

Description of the analysis workflow

We perform the following analysis to assess for the quality of clustering:

[1] We perform some quality check to assess any QC-induced clustering (nFeature, nCount, percent.mito).

[2] We add cell cycle information, as we know that in a specific cell cycle state, the transcriptional program is mostly/exclusively related to cell cycle genes and the identity of cells is difficult to determine. We expect these cells to cluster together in a cluster of proliferating cells.

[3] We look at specific marker genes that we reported in the table marker.sets/CellType_metadata.csv to check the relevance of the clustering.

[4] We look at specific pathways that we reported in the table marker.sets/Pathways_metadata.csv to check the relevance of the clustering.

[5] We run DElegate::FindAllMarkers2 to find markers of the different clusters and manually check if they do make sense. DElegate::FindAllMarkers2 is an improved version of Seurat::FindAllMarkers based on pseudobulk differential expression method.

[6] We perform enrichment analysis of marker genes for each seurat clusters. We defined all the genes from the seurat object as the universe and used the MSigDB gene sets.

[7] We plot pca/umap reduction grouping with available annotations (singler_, cellassign_). We expect at least immune cells to be correctly label and fall into a few set of clusters.

[8] We run label transfer (Azimuth) to transfer annotation from the fetal kidney atlas human reference. We plot pca/umap reduction grouping with latest labels. We expect it to be the most representative of the cell types in the sample.

Selected parameters

Please note: to keep the notebook as straight as possible, we decided to show the analysis for the selected set of parameters:

  • SCTransform (default parameters)
  • RunPCA (default)
  • RunUMAP (dims = 1:50)
  • FindNeighbors (dims = 1:50)
  • FindClusters (default)

Note: Other parameters have been previously tested, but we would like to show in the following report that the one selected is performing good.

Analysis content

Load the processed SingleCellExperiment rds object

file <- list.files(filedir)
file <- file[grepl("_processed.rds", file)]

# open the processed rds object
sce <- readRDS(paste0(filedir, "/", file))

Run Seurat pipeline

# convert to seurat
srat <- CreateSeuratObject(counts = counts(sce),
                                    assay = "RNA",
                                    project = sample
                           )

# convert colData and rowData to data.frame for use in the Seurat object
cell_metadata <- as.data.frame(colData(sce))
row_metadata <- as.data.frame(rowData(sce))

# add cell metadata (colData) from SingleCellExperiment to Seurat
srat@meta.data <- cell_metadata

# add row metadata (rowData) from SingleCellExperiment to Seurat
srat[["RNA"]]@meta.data <- row_metadata

# add metadata from SingleCellExperiment to Seurat
srat@misc <- metadata(sce)


# Normalization
options(future.globals.maxSize= 891289600)
srat <- SCTransform(srat, verbose = F, conserve.memory = TRUE)

# dimensionality reduction
srat <- RunPCA(srat, verbose = F)
srat <- RunUMAP(srat, dims = 1:50, verbose = F)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
# clustering
srat    <- FindNeighbors(srat, dims = 1:50, verbose = F)
srat    <- FindClusters(srat, verbose = T)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 22476
## Number of edges: 659691
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8197
## Number of communities: 14
## Elapsed time: 3 seconds

Visualize seurat clusters

d2 <- SCpubr::do_DimPlot(srat, reduction="umap", group.by = "seurat_clusters", label = TRUE) + ggtitle("Seurat Cluster - umap")
d1 <- SCpubr::do_DimPlot(srat, reduction="pca", group.by = "seurat_clusters", label = TRUE) + ggtitle("Seurat Cluster - pca")
v <- SCpubr::do_ViolinPlot(srat, features = c( "subsets_mito_percent"), ncol = 1, group.by = "seurat_clusters", legend.position = "none") 

d1 + d2 + v + plot_layout(ncol = 3, widths = c(2,2,4))

Cell cycle information

s.genes <- srat@assays$RNA@meta.data$gene_ids[srat@assays$RNA@meta.data$gene_symbol %in% cc.genes$s.genes]
g2m.genes <- srat@assays$RNA@meta.data$gene_ids[srat@assays$RNA@meta.data$gene_symbol %in% cc.genes$g2m.genes]

srat <- CellCycleScoring(srat, s.features = s.genes, g2m.features = g2m.genes, set.ident = FALSE)
d2 <- SCpubr::do_DimPlot(srat, reduction="umap", group.by = "Phase", label = TRUE) + ggtitle("Phase - umap")
d1 <- SCpubr::do_DimPlot(srat, reduction="umap", group.by = "seurat_clusters", label = TRUE) + ggtitle("Seurat Cluster - umap")

b <- SCpubr::do_BarPlot(sample = srat,
                         group.by = "Phase",
                         split.by = "seurat_clusters",
                         position = "fill",
                         font.size = 10,
                         legend.ncol = 3) +
                         ggtitle("% cells")+
                         xlab(file)
d1 + d2 + b + plot_layout(ncol = 3, widths = c(2,2,4))

 SCpubr::do_ViolinPlot(srat, features = c("S.Score", "G2M.Score"), ncol = 2, group.by = "seurat_clusters", legend.position = "none") 

Look at specific genes

Here, we open the table of marker genes marker-sets/CellType_metadata.csv.

  # open the set of reference marker genes

CellType_metadata <- read.table("marker-sets/CellType_metadata.csv", header = TRUE, sep = ",")
    
DT::datatable(CellType_metadata, caption = ("CellType_metadata"), 
              extensions = 'Buttons', 
              options = list(  dom = 'Bfrtip',

                               buttons = c( 'csv', 'excel')))

Malignant markers

 SCpubr::do_ViolinPlot(srat, features = rownames(srat)[rownames(srat) %in% (CellType_metadata$ENSEMBL_ID[CellType_metadata$cell_class %in% c("malignant")])] , ncol = 9, group.by = "seurat_clusters", legend.position = "none") 

All clustersexcept C13 are IGF2+ (ENSG00000167244).

d1 <- SCpubr::do_DimPlot(srat, reduction="umap", group.by = "seurat_clusters", label = TRUE, repel = TRUE) + ggtitle("Seurat Cluster - umap")
d2 <- SCpubr::do_FeaturePlot(srat, features = "ENSG00000167244", pt.size = 0.2) + ggtitle("IGF2")
v <- SCpubr::do_ViolinPlot(srat, features = "ENSG00000167244", ncol = 1, group.by = "seurat_clusters", legend.position = "none") 

d1 + d2 + v + plot_layout(ncol = 3, widths = c(1,1,3))

The set of clusters 0-1-4-8-9-11-12 is NCAM1+ (ENSG00000149294), which is a blastemal marker.

d1 <- SCpubr::do_DimPlot(srat, reduction="umap", group.by = "seurat_clusters", label = TRUE, repel = TRUE) + ggtitle("Seurat Cluster - umap")
d2 <- SCpubr::do_FeaturePlot(srat, features = "ENSG00000149294",  pt.size = 0.2) + ggtitle("NCAM1")
v <- SCpubr::do_ViolinPlot(srat, features = "ENSG00000149294", ncol = 1, group.by = "seurat_clusters", legend.position = "none") 

d1 + d2 + v + plot_layout(ncol = 3, widths = c(1,1,3))

All clusters except of C13 are WT1+ (ENSG00000184937).

d1 <- SCpubr::do_DimPlot(srat, reduction="umap", group.by = "seurat_clusters", label = TRUE, repel = TRUE) + ggtitle("Seurat Cluster - umap")
d2 <- SCpubr::do_FeaturePlot(srat, features = "ENSG00000184937", pt.size = 0.2) + ggtitle("WT1")
v <- SCpubr::do_ViolinPlot(srat, features = "ENSG00000184937", ncol = 1, group.by = "seurat_clusters", legend.position = "none") 

d1 + d2 + v + plot_layout(ncol = 3, widths = c(1,1,3))

C10 is positive for COL6A3 which is a marker of stroma Wilms tumor cells (also normal stroma)

d1 <- SCpubr::do_DimPlot(srat, reduction="umap", group.by = "seurat_clusters", label = TRUE, repel = TRUE) + ggtitle("Seurat Cluster - umap")
d2 <- SCpubr::do_FeaturePlot(srat, features = "ENSG00000163359", pt.size = 0.2) + ggtitle("COL6A3")
v <- SCpubr::do_ViolinPlot(srat, features = "ENSG00000163359", ncol = 1, group.by = "seurat_clusters", legend.position = "none") 

d1 + d2 + v + plot_layout(ncol = 3, widths = c(1,1,3))

Immune markers

 SCpubr::do_ViolinPlot(srat, features = rownames(srat)[rownames(srat) %in% (CellType_metadata$ENSEMBL_ID[CellType_metadata$cell_class %in% c("immune")])]
                       , ncol = 6, group.by = "seurat_clusters", legend.position = "none") 

C13 is PTPRC+ (ENSG00000081237) and must be a cluster of immune cells. Wilms tumor are generally cold tumor, especially untreated Wilms tumor. According to the clinical data, the present sample is untreated (treatment = upfront resection). We do not expect high percentage of immune cells. The size of the cluster would therefore make sense.

d1 <- SCpubr::do_DimPlot(srat, reduction="umap", group.by = "seurat_clusters", label = TRUE, repel = TRUE) + ggtitle("Seurat Cluster - umap")
d2 <- SCpubr::do_FeaturePlot(srat, features = "ENSG00000081237", pt.size = 0.2) + ggtitle("PTPRC = CD45")
v <- SCpubr::do_ViolinPlot(srat, features = "ENSG00000081237", ncol = 1, group.by = "seurat_clusters", legend.position = "none", pt.size = 2) 

d1 + d2 + v + plot_layout(ncol = 3, widths = c(1,1,3))

Other markers

 SCpubr::do_ViolinPlot(srat, features = rownames(srat)[rownames(srat) %in% (CellType_metadata$ENSEMBL_ID[! CellType_metadata$cell_class %in% c("immune", "malignant")])]
                       , ncol = 6, group.by = "seurat_clusters", legend.position = "none") 

The set of clusters 8-11 is EPCAM+ (ENSG00000119888), which is a marker of epithelial (malignant and normal) kidney cells.

d1 <- SCpubr::do_DimPlot(srat, reduction="umap", group.by = "seurat_clusters", label = TRUE, repel = TRUE) + ggtitle("Seurat Cluster - umap")
d2 <- SCpubr::do_FeaturePlot(srat, features = "ENSG00000119888", pt.size = 0.2) + ggtitle("EPCAM")
v <- SCpubr::do_ViolinPlot(srat, features = "ENSG00000119888", ncol = 1, group.by = "seurat_clusters", legend.position = "none") 

d1 + d2 + v + plot_layout(ncol = 3, widths = c(1,1,3))

Look at specific pathways

Here, we open the table of marker genes marker-sets/Pathway_metadata.csv.

# open the set of reference marker genes

Pathway_metadata <- read.csv("marker-sets/Pathway_metadata.csv", header = TRUE, sep = ",")
    
DT::datatable(Pathway_metadata, caption = ("CellType_metadata"), 
              extensions = 'Buttons', 
              options = list(  dom = 'Bfrtip',

                               buttons = c( 'csv', 'excel')))

TP53 pathway

here we will calculate a TP53 score using AddMduleScore and the genes of the HALLMARK_P53_PATHWAY gene set.

## define genesets
hallmarks <- msigdbr(species = "human", category = "H")

TP53_list = hallmarks %>% 
  filter(gs_name == "HALLMARK_P53_PATHWAY") %>%
  dplyr::distinct(gs_name, gene_symbol, human_ensembl_gene) %>% as.data.frame()

TP53_list <- list(TP53_list$human_ensembl_gene)

srat <- AddModuleScore(srat, features = TP53_list, name = "TP53_score")
## Warning: The following features are not present in the object: ENSG00000179593,
## ENSG00000147883, ENSG00000245848, ENSG00000137975, ENSG00000284841,
## ENSG00000276536, ENSG00000206377, ENSG00000206478, ENSG00000227231,
## ENSG00000230128, ENSG00000235030, ENSG00000237155, ENSG00000177551,
## ENSG00000175793, ENSG00000206297, ENSG00000224212, ENSG00000224748,
## ENSG00000226173, ENSG00000227816, ENSG00000230705, ENSG00000232367, not
## searching for symbol synonyms
d1 <- SCpubr::do_DimPlot(srat, reduction="umap", group.by = "seurat_clusters", label = TRUE, repel = TRUE) + ggtitle("Seurat Cluster - umap")
d2 <- SCpubr::do_FeaturePlot(srat, features = "TP53_score1", pt.size = 0.2) + ggtitle("TP53 score")
v <- SCpubr::do_ViolinPlot(srat, features = "TP53_score1", ncol = 1, group.by = "seurat_clusters", legend.position = "none") 

d1 + d2 + v + plot_layout(ncol = 3, widths = c(1,1,3))

“Normal” cells (immune, C13) have a slightly higher TP53 score.

DNA repair pathway

here we will calculate a DNA_repair score using AddMduleScore and the genes of the HALLMARK_DNA_REPAIR gene set.

DNA_repair = hallmarks %>% 
  filter(gs_name == "HALLMARK_DNA_REPAIR") %>%
  dplyr::distinct(gs_name, gene_symbol, human_ensembl_gene) %>% as.data.frame()

DNA_repair_list <- list(DNA_repair$human_ensembl_gene)

srat <- AddModuleScore(srat, features = DNA_repair_list, name = "DNA_repair_score")
## Warning: The following features are not present in the object: ENSG00000152669,
## ENSG00000284752, ENSG00000288114, ENSG00000206268, ENSG00000206357,
## ENSG00000229363, ENSG00000231044, ENSG00000233801, ENSG00000180099,
## ENSG00000206502, ENSG00000224859, ENSG00000233795, ENSG00000235176,
## ENSG00000235443, ENSG00000236808, ENSG00000236949, ENSG00000284832,
## ENSG00000280627, ENSG00000276463, ENSG00000285339, ENSG00000274352, not
## searching for symbol synonyms
d1 <- SCpubr::do_DimPlot(srat, reduction="umap", group.by = "seurat_clusters", label = TRUE, repel = TRUE) + ggtitle("Seurat Cluster - umap")
d2 <- SCpubr::do_FeaturePlot(srat, features = "DNA_repair_score1", pt.size = 0.2) + ggtitle("DNA_repair score")
v <- SCpubr::do_ViolinPlot(srat, features = "DNA_repair_score1", ncol = 1, group.by = "seurat_clusters", legend.position = "none") 

d1 + d2 + v + plot_layout(ncol = 3, widths = c(1,1,3))

I am not sure how relevant the DNA repair score is. To be tested on more samples.

DROSHA target genes

## define genesets
c3 <- msigdbr(species = "human", category = "C3", subcategory = "TFT:GTRD")

DROSHA_list = c3 %>% 
  filter(gs_name == "DROSHA_TARGET_GENES") %>%
  dplyr::distinct(gs_name, gene_symbol, human_ensembl_gene) %>% as.data.frame()

DROSHA_list <- list(DROSHA_list$human_ensembl_gene)

srat <- AddModuleScore(srat, features = DROSHA_list, name = "DROSHA_score")
## Warning: The following features are not present in the object: ENSG00000265134,
## ENSG00000210164, ENSG00000210100, ENSG00000210156, ENSG00000209082,
## ENSG00000210112, ENSG00000210107, ENSG00000210151, ENSG00000210077,
## ENSG00000248923, ENSG00000274978, ENSG00000199568, ENSG00000207205,
## ENSG00000275538, not searching for symbol synonyms
d1 <- SCpubr::do_DimPlot(srat, reduction="umap", group.by = "seurat_clusters", label = TRUE, repel = TRUE) + ggtitle("Seurat Cluster - umap")
d2 <- SCpubr::do_FeaturePlot(srat, features = "DROSHA_score1", pt.size = 0.2) + ggtitle("DROSHA score")
v <- SCpubr::do_ViolinPlot(srat, features = "DROSHA_score1", ncol = 1, group.by = "seurat_clusters", legend.position = "none") 

d1 + d2 + v + plot_layout(ncol = 3, widths = c(1,1,3))

DICER1 target genes

DICER1_list = c3 %>% 
  filter(gs_name == "DICER1_TARGET_GENES") %>%
  dplyr::distinct(gs_name, gene_symbol, human_ensembl_gene) %>% as.data.frame()

DICER1_list <- list(DICER1_list$human_ensembl_gene)

srat <- AddModuleScore(srat, features = DICER1_list, name = "DICER1_score")
## Warning: The following features are not present in the object: ENSG00000224157,
## ENSG00000210049, ENSG00000210100, ENSG00000210156, ENSG00000210196,
## ENSG00000210151, ENSG00000210195, ENSG00000261441, ENSG00000222328,
## ENSG00000199347, not searching for symbol synonyms
d1 <- SCpubr::do_DimPlot(srat, reduction="umap", group.by = "seurat_clusters", label = TRUE, repel = TRUE) + ggtitle("Seurat Cluster - umap")
d2 <- SCpubr::do_FeaturePlot(srat, features = "DICER1_score1", pt.size = 0.2) + ggtitle("DICER1 score")
v <- SCpubr::do_ViolinPlot(srat, features = "DICER1_score1", ncol = 1, group.by = "seurat_clusters", legend.position = "none") 

d1 + d2 + v + plot_layout(ncol = 3, widths = c(1,1,3))

Not sure what to do with these scores.

Find marker genes for each of the seurat clusters

In addition to the list of known marker genes, we used an unbiased approach to find transcripts that characterized the different clusters. We run DElegate::FindAllMarkers2 to find markers of the different clusters and manually check if they do make sense. DElegate::FindAllMarkers2 is an improved version of Seurat::FindAllMarkers based on pseudobulk differential expression method. Please check the preprint from Chistoph Hafemeister: https://www.biorxiv.org/content/10.1101/2023.03.28.534443v1 and tool described here: https://github.com/cancerbits/DElegate

feature_conversion <- srat@assays$RNA@meta.data
de_results   <- DElegate::FindAllMarkers2(srat, group_column = "seurat_clusters")

#filter the most relevant markers
s.markers <- de_results[de_results$padj < cfg$padj_thershold & de_results$log_fc > cfg$lfc_threshold & de_results$rate1 > cfg$rate1_threshold,]

# add gene symbol for easiest interpretation of the result
s.markers$gene_ids <- s.markers$feature
s.markers <- left_join(s.markers,feature_conversion, by = c( "gene_ids") )
identical(s.markers$feature, s.markers$gene_ids)     # check the quality of the merge, must be true
## [1] TRUE
DT::datatable(s.markers, caption = ("marker genes"), 
              extensions = 'Buttons', 
              options = list(  dom = 'Bfrtip',

                               buttons = c( 'csv', 'excel')))
# Select top 5 genes for heatmap plotting
s.markers <- na.omit(s.markers)
s.markers %>%
    group_by(group1) %>%
    top_n(n =  5, wt = log_fc) -> top5

# subset for plotting
cells <- WhichCells(srat, downsample = 100)
ss <- subset(srat, cells = cells)
ss <- ScaleData(ss, features = top5$feature)

p1 <- SCpubr::do_DimPlot(srat, reduction="umap", group.by = "seurat_clusters", label = TRUE, repel = TRUE) + ggtitle("Seurat Cluster - umap")
p2 <- DoHeatmap(ss, features = top5$feature,  cells = cells, group.by = "seurat_clusters") + NoLegend() + 
  scale_fill_gradientn(colors =  c("#01665e","#35978f",'darkslategray3', "#f7f7f7", "#fee391","#fec44f","#F9AD03")) 
p3 <- ggplot(srat@meta.data, aes(seurat_clusters, fill = seurat_clusters)) + geom_bar() + NoLegend()


common_title <- sprintf("Unsupervised clustering %s, %d cells", srat@meta.data$orig.ident[1], ncol(srat))
show((((p1 / p3) + plot_layout(heights = c(3,2)) | p2) ) + plot_layout(widths = c(1, 2)) + plot_layout(heights = c(3,1)) + plot_annotation(title = common_title))

DT::datatable(top5[, c(1, 9, 11, 12)], caption = ("top 5 marker genes"), 
              extensions = 'Buttons', 
              options = list(  dom = 'Bfrtip',

                               buttons = c( 'csv', 'excel')))

The cluster that we found seems to make sense as they can be defined by a few number of specific genes each. I don’t know most of the marker genes found, but few point below that could support the analysis and annotation.

PTPRC (C13, CD45), define the immune cluster.

Enrichment analysis of marker genes

Here we perform enrichment analysis of the marker genes found in the previous section for each Seurat cluster.

We defined as universe/background all the genes expressed in the dataset, meaning the rownames of the Seurat object.

We used three gene sets from MSiGDB:

  • the hallmark gene sets are coherently expressed signatures derived by aggregating many MSigDB gene sets to represent well-defined biological states or processes.
  • the C3 : regulatory target gene sets based on gene target predictions for microRNA seed sequences and predicted transcription factor binding sites.
  • the C8 : cell type signature gene sets curated from cluster markers identified in single-cell sequencing studies of human tissue.

We used enricher function from clusterProfiler to perform enrichment analysis.

# define background genes = universe for enrichment
background <-  row_metadata$gene_ids

## define genesets
c8 <- msigdbr(species = "human", category = "C8")

msigdbr_hallmarks = hallmarks %>% dplyr::distinct(gs_name, ensembl_gene) %>% as.data.frame()
msigdbr_c3 = c3 %>% dplyr::distinct(gs_name, ensembl_gene) %>% as.data.frame()
msigdbr_c8 = c8 %>% dplyr::distinct(gs_name, ensembl_gene) %>% as.data.frame()

Hallmarks MSiGDB gene sets

tmp_H <- NULL
for(i in unique(srat$seurat_clusters)){
signature = s.markers$feature[s.markers$group1 == i]
ego_module <- enricher(gene = signature, universe = background, TERM2GENE = msigdbr_hallmarks)
tmp <- NULL

  if(!is.null(ego_module)){
    if(dim(ego_module)[1]>0){
    b_H <- barplot(ego_module, showCategory = 15, font.size = 20, label_format = 40)  
    tmp <- b_H$data
    tmp$set <- "Hallmarks"
    tmp$cluster <- i
    }
  }
tmp_H <- rbind(tmp_H, tmp)
}



ggplot(tmp_H, aes(Count, ID)) +
    geom_bar(stat = "identity", aes(fill=cluster))+

  theme_classic()+ 
    geom_text(
        aes(label = (paste( "Padj = ", round(p.adjust,2)))),
        color = "black",
        size = 3,
        hjust=1,
        position = position_dodge(0.5)
    )+
   theme(text = element_text(size=14))+
  facet_wrap(facets=c("cluster"), ncol = length(unique(tmp_H$cluster)))

MYC enrichment is expected in anaplastic samples.

C3 MSiGDB regulatory target gene sets

tmp_H <- NULL
for(i in unique(srat$seurat_clusters)){
signature = s.markers$feature[s.markers$group1 == i]
ego_module <- enricher(gene = signature, universe = background, TERM2GENE = msigdbr_c3)
tmp <- NULL

  if(!is.null(ego_module)){
    if(dim(ego_module)[1]>0){
    b_H <- barplot(ego_module, showCategory = 15, font.size = 20, label_format = 40)  
    tmp <- b_H$data
    tmp$ID <- factor(tmp$ID, levels = c(tmp$ID[order(tmp$Count)]))
    tmp$set <- "c8"
    tmp$cluster <- i
    }
  }
tmp_H <- rbind(tmp_H, tmp[1:5,])
}



ggplot(tmp_H, aes(Count, ID)) +
    geom_bar(stat = "identity", aes(fill=cluster))+

  theme_classic()+ 
    geom_text(
        aes(label = (paste( "Padj = ", round(p.adjust,2)))),
        color = "black",
        size = 3,
        hjust=1,
        position = position_dodge(0.5)
    )+
   theme(text = element_text(size=14))+
  facet_wrap(facets=c("cluster"), ncol = length(unique(tmp_H$cluster)))
## Warning: Removed 10 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Warning: Removed 10 rows containing missing values or values outside the scale range
## (`geom_text()`).

C8 MSiGDB cell type signature gene sets

tmp_H <- NULL
for(i in unique(srat$seurat_clusters)){
signature = s.markers$feature[s.markers$group1 == i]
ego_module <- enricher(gene = signature, universe = background, TERM2GENE = msigdbr_c8)
tmp <- NULL

  if(!is.null(ego_module)){
    if(dim(ego_module)[1]>0){
    b_H <- barplot(ego_module, showCategory = 15, font.size = 20, label_format = 40)  
    tmp <- b_H$data
    tmp$ID <- factor(tmp$ID, levels = c(tmp$ID[order(tmp$Count)]))
    tmp$set <- "c8"
    tmp$cluster <- i
    }
  }
tmp_H <- rbind(tmp_H, tmp[1:5,])
}



ggplot(tmp_H, aes(Count, ID)) +
    geom_bar(stat = "identity", aes(fill=cluster))+

  theme_classic()+ 
    geom_text(
        aes(label = (paste( "Padj = ", round(p.adjust,2)))),
        color = "black",
        size = 3,
        hjust=1,
        position = position_dodge(0.5)
    )+
   theme(text = element_text(size=14))+
  facet_wrap(facets=c("cluster"), ncol = length(unique(tmp_H$cluster)))

  • C10 is enriched in adulte differentiated cell types/pathway of the kidney. This could suggest that C10 are from normal kidney. To be checked with inferCNV profile.

  • C2, C5are enriched in muscle pathways, suggesting a stromal origin.

Annotation from SingleR (no cancer or kidney specific dataset)

Here, we quickly checked annotations that are present in the _processed rds object. However, the automated annotation have not been performed using a cancer specific reference or a kidney reference. We do not expect a nice labelling of the cells as the overlap of cell types between the reference and the query dataset is poor. This support the need to do a proper label transfer from the fetal kidney atlas, which is imho the best reference that can be applied to a Wilms tumor query.

d2 <-DimPlot(srat, group.by = "singler_celltype_annotation", reduction = "umap", label = TRUE, repel = TRUE) + NoLegend()



DT::datatable(table(srat$seurat_clusters, srat$singler_celltype_annotation), caption = ("table of SingleR annotation per seurat clusters"), 
              extensions = 'Buttons', 
              options = list(  dom = 'Bfrtip',

                               buttons = c( 'csv', 'excel')))
d3 <- SCpubr::do_BarPlot(sample = srat,
                         group.by = "singler_celltype_annotation",
                         split.by = "seurat_clusters",
                         position = "fill",
                         font.size = 10,
                         legend.ncol = 3) +
                         ggtitle("% cells")+
                         xlab(file)

d2|d3
## Warning: ggrepel: 12 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Annotation from cellassign (no cancer or kidney specific dataset)

d2 <-DimPlot(srat, group.by = "cellassign_celltype_annotation", reduction = "umap", label = TRUE, repel = TRUE) + NoLegend()


DT::datatable(table(srat$seurat_clusters, srat$cellassign_celltype_annotation)
, caption = ("table of cellassign annotation per seurat clusters"), 
              extensions = 'Buttons', 
              options = list(  dom = 'Bfrtip',

                               buttons = c( 'csv', 'excel')))
d3 <- SCpubr::do_BarPlot(sample = srat,
                         group.by = "cellassign_celltype_annotation",
                         split.by = "seurat_clusters",
                         position = "fill",
                         font.size = 10,
                         legend.ncol = 3) +
                         ggtitle("% cells")+
                         xlab(file)

d2|d3
## Warning: ggrepel: 6 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Even if the result of SingleR and CellAssign are not specific to a Wilms tumor dataset, it give the first impression that C13 a cluster of immune cells.

Azimuth annotation from fetal kidney

Note to the DataLab : where should I save the fetal kidney reference? In S3 like results as it is a quite big object or in the folder marker-sets?

For more information related to the reference, please go to https://www.kidneycellatlas.org/ You will find:

  • interactive viewer

  • h5ad files to download.

Please note that as Wilms tumor have been described to be closer to fetal kidney as mature kidney, we only used the fetal kidney atlas as the reference. Also check : https://www.science.org/doi/10.1126/science.aat5031

This part is imho one of the most important step that allow us to have a quick and reliable idea of the composition of the different clusters. The predicted compartment are defined into 4 categories:

  • endotheliúm
  • stroma
  • fetal nephron
  • immune

As for SingleR and CellAssign, the annotation of immune cells and endothelial cells is straightforward. The stroma compartment should then contain normal and cancer stromal cells. The fetal nephron compartment contain blastema cancer cells a well as normal and cancer epithelial cells.

Further segregation of cancer versus normal cells will be achieved using a combination of markers/pathways (see above) and inferred CNV (to be done).

# access to the azimuth kidney reference 
##### to be updated depending on the location of the reference file 
ref_dir = file.path(module_base, "marker-sets/Azimuth_Compatible_Fetal_full/")

reference <- LoadReference(ref_dir)
## Warning: Overwriting miscellanous data for model
## Warning: Adding a dimensional reduction (refUMAP) without the associated assay
## being present
## Warning: Adding a dimensional reduction (refUMAP) without the associated assay
## being present
# Load the query object for mapping

# Preprocess with SCTransform
# Note, the RunAzimuth change a bit the Seurat object, changing for example the rownames to Symbol. 
# We thus choose to rename the object to keep the 
data <-  RunAzimuth(srat, ref_dir, assay = 'RNA')
## Warning: Overwriting miscellanous data for model
## Warning: Adding a dimensional reduction (refUMAP) without the associated assay
## being present
## Warning: Adding a dimensional reduction (refUMAP) without the associated assay
## being present
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
## Warning: No layers found matching search pattern provided
## Warning: 942 features of the features specified were not present in both the reference query assays. 
## Continuing with remaining 2058 features.
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
## Warning: Keys should be one or more alphanumeric characters followed by an
## underscore, setting key from integrated_dr_ to integrateddr_
## Warning in RunUMAP.default(object = neighborlist, reduction.model =
## reduction.model, : Number of neighbors between query and reference is not equal
## to the number of neighbors within reference
## Warning: No assay specified, setting assay as RNA by default.
rm(reference)
gc()
##             used   (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  19316702 1031.7   33088385 1767.2  33088385 1767.2
## Vcells 361685983 2759.5  873084858 6661.2 873078453 6661.1
DT::datatable(table(data$seurat_clusters, data$predicted.compartment), caption = ("table of azimuth compartment annotation per seurat clusters"),               extensions = 'Buttons', 
              options = list(  dom = 'Bfrtip',
              buttons = c( 'csv', 'excel')))
DT::datatable(table(data$seurat_clusters, data$predicted.celltype), caption = ("table of azimuth annotation per seurat clusters"), 
              extensions = 'Buttons', 
              options = list(  dom = 'Bfrtip',
              buttons = c( 'csv', 'excel')))
d1 <- DimPlot(data, reduction = "umap", group.by = "seurat_clusters", label = TRUE) + NoLegend()
d2 <- DimPlot(data, reduction = "umap", dims = c(1,2), group.by = "predicted.compartment", label = TRUE, repel = TRUE) + NoLegend()
d3 <- DimPlot(data, reduction = "umap", dims = c(1,2), group.by = "predicted.celltype", label = TRUE, repel = TRUE) + NoLegend()

d1 | d2 | d3

Looking at the predicted compartment, we are satisfied that the immune, endothelium and stroma compartment are specific to a set of clusters. It is expected that the fetal__nephron compartment match to both blastema and epithelial cancer and normal kidney. It is thus OK to have it labeling two distinct set of clusters.

Looking at the predicted cell types, we observed that most of the clusters are labeled as cap mesenchyme, an early developmental stage of nephron. This suggest that these clusters are either blastema or primitive epithelium.

f1 <- SCpubr::do_BarPlot(sample = data,
                         group.by = "predicted.compartment",
                         split.by = "seurat_clusters",
                         position = "fill",
                         font.size = 10,
                         legend.ncol = 3) +
                         ggtitle("% cells")+
                         xlab(file)

f2 <- SCpubr::do_BarPlot(sample = data,
                         group.by = "predicted.celltype",
                         split.by = "seurat_clusters",
                         position = "fill",
                         font.size = 10,
                         legend.ncol = 3) +
                         ggtitle("% cells")+
                         xlab(file)

f1 | f2

Summary and conclusion

Taking together, we can conclude:

  • C13 = immune cells (with high confidency)

  • C10, is composed of stroma (normal or cancer) cells and endothelium cells. Might be a cluster of normal cells, a bit mixed. To be checked with inferCNV.

  • the two other set of clusters are cancer cells, with positivity for some Wilms tumor markers such as NCAM, WT1, EPCAM.

Additionally, we showed the feasibility to run label transfer using runAzimuth (part of step 3 of the proposed analysis: https://github.com/maud-p/OpenScPCA-analysis/issues/1).

We also showed the easy and robust identification of normal cells (endothelial and immune cells) that would serve as reference for the CNV inference (inferCNV, step 4 of the analysis).

Next step

[ ] We will adapt this template to be render on all the 40 Wilms tumor samples of the dataset.

[ ] We will save for each sample the rds file

[ ] We will run inferCNV for each sample to decide the malignant/normal status of some stroma and epithelial cluster, and confirm the blastema annotation.

The next step will provide us a better understanding of the entire cohort. We will then have to set up a strategy to annotate each sample. Open questions are:

[ ] should we annotate single cell
or
[] consider applying similar annotations to all cells in a cluster?

[ ] manual annotation of each cluster / each patient
or
[] automated annotation using some threshold?

Session Info

# record the versions of the packages used in this analysis and other environment information
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Vienna
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] data.table_1.15.4           clusterProfiler_4.12.2     
##  [3] msigdbr_7.5.1               Azimuth_0.5.0              
##  [5] shinyBS_0.61.1              DElegate_1.2.1             
##  [7] edgeR_4.2.1                 limma_3.60.4               
##  [9] viridis_0.6.5               viridisLite_0.4.2          
## [11] SCpubr_2.0.2                ggplotify_0.1.2            
## [13] patchwork_1.2.0             ggplot2_3.5.1              
## [15] DT_0.33                     dplyr_1.1.4                
## [17] tidyr_1.3.1                 Seurat_5.1.0               
## [19] SeuratObject_5.0.2          sp_2.1-4                   
## [21] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
## [23] Biobase_2.64.0              GenomicRanges_1.56.1       
## [25] GenomeInfoDb_1.40.1         IRanges_2.38.1             
## [27] S4Vectors_0.42.1            BiocGenerics_0.50.0        
## [29] MatrixGenerics_1.16.0       matrixStats_1.3.0          
## 
## loaded via a namespace (and not attached):
##   [1] R.methodsS3_1.8.2                 poweRlaw_0.80.0                  
##   [3] goftest_1.2-3                     Biostrings_2.72.1                
##   [5] vctrs_0.6.5                       spatstat.random_3.3-1            
##   [7] digest_0.6.36                     png_0.1-8                        
##   [9] ggrepel_0.9.5                     deldir_2.0-4                     
##  [11] parallelly_1.38.0                 MASS_7.3-61                      
##  [13] Signac_1.13.0                     reshape2_1.4.4                   
##  [15] httpuv_1.6.15                     qvalue_2.36.0                    
##  [17] withr_3.0.1                       xfun_0.46                        
##  [19] ggfun_0.1.5                       survival_3.7-0                   
##  [21] EnsDb.Hsapiens.v86_2.99.0         memoise_2.0.1                    
##  [23] gson_0.1.0                        tidytree_0.4.6                   
##  [25] zoo_1.8-12                        gtools_3.9.5                     
##  [27] pbapply_1.7-2                     R.oo_1.26.0                      
##  [29] KEGGREST_1.44.1                   promises_1.3.0                   
##  [31] httr_1.4.7                        restfulr_0.0.15                  
##  [33] globals_0.16.3                    fitdistrplus_1.2-1               
##  [35] rhdf5filters_1.16.0               rhdf5_2.48.0                     
##  [37] rstudioapi_0.16.0                 UCSC.utils_1.0.0                 
##  [39] miniUI_0.1.1.1                    generics_0.1.3                   
##  [41] DOSE_3.30.2                       babelgene_22.9                   
##  [43] curl_5.2.1                        zlibbioc_1.50.0                  
##  [45] ggraph_2.2.1                      polyclip_1.10-7                  
##  [47] GenomeInfoDbData_1.2.12           SparseArray_1.4.8                
##  [49] xtable_1.8-4                      stringr_1.5.1                    
##  [51] pracma_2.4.4                      evaluate_0.24.0                  
##  [53] S4Arrays_1.4.1                    hms_1.1.3                        
##  [55] irlba_2.3.5.1                     colorspace_2.1-1                 
##  [57] hdf5r_1.3.11                      ROCR_1.0-11                      
##  [59] reticulate_1.38.0                 spatstat.data_3.1-2              
##  [61] magrittr_2.0.3                    lmtest_0.9-40                    
##  [63] readr_2.1.5                       glmGamPoi_1.16.0                 
##  [65] later_1.3.2                       ggtree_3.12.0                    
##  [67] lattice_0.22-6                    spatstat.geom_3.3-2              
##  [69] future.apply_1.11.2               shadowtext_0.1.4                 
##  [71] scattermore_1.2                   XML_3.99-0.17                    
##  [73] cowplot_1.1.3                     RcppAnnoy_0.0.22                 
##  [75] pillar_1.9.0                      nlme_3.1-165                     
##  [77] pwalign_1.0.0                     caTools_1.18.2                   
##  [79] compiler_4.4.1                    RSpectra_0.16-2                  
##  [81] stringi_1.8.4                     lubridate_1.9.3                  
##  [83] tensor_1.5                        GenomicAlignments_1.40.0         
##  [85] plyr_1.8.9                        crayon_1.5.3                     
##  [87] abind_1.4-5                       BiocIO_1.14.0                    
##  [89] gridGraphics_0.5-1                googledrive_2.1.1                
##  [91] locfit_1.5-9.10                   graphlayouts_1.1.1               
##  [93] bit_4.0.5                         fastmatch_1.1-4                  
##  [95] codetools_0.2-20                  crosstalk_1.2.1                  
##  [97] bslib_0.8.0                       SeuratData_0.2.2.9001            
##  [99] plotly_4.10.4                     mime_0.12                        
## [101] splines_4.4.1                     Rcpp_1.0.13                      
## [103] fastDummies_1.7.3                 sparseMatrixStats_1.16.0         
## [105] HDO.db_0.99.1                     cellranger_1.1.0                 
## [107] knitr_1.48                        blob_1.2.4                       
## [109] utf8_1.2.4                        seqLogo_1.70.0                   
## [111] AnnotationFilter_1.28.0           fs_1.6.4                         
## [113] listenv_0.9.1                     DelayedMatrixStats_1.26.0        
## [115] tibble_3.2.1                      Matrix_1.7-0                     
## [117] statmod_1.5.0                     tzdb_0.4.0                       
## [119] tweenr_2.0.3                      pkgconfig_2.0.3                  
## [121] tools_4.4.1                       cachem_1.1.0                     
## [123] RSQLite_2.3.7                     DBI_1.2.3                        
## [125] fastmap_1.2.0                     rmarkdown_2.27                   
## [127] scales_1.3.0                      grid_4.4.1                       
## [129] ica_1.0-3                         shinydashboard_0.7.2             
## [131] Rsamtools_2.20.0                  sass_0.4.9                       
## [133] dotCall64_1.1-1                   RANN_2.6.1                       
## [135] farver_2.1.2                      tidygraph_1.3.1                  
## [137] scatterpie_0.2.3                  yaml_2.3.10                      
## [139] rtracklayer_1.64.0                cli_3.6.3                        
## [141] purrr_1.0.2                       leiden_0.4.3.1                   
## [143] lifecycle_1.0.4                   uwot_0.2.2                       
## [145] presto_1.0.0                      BSgenome.Hsapiens.UCSC.hg38_1.4.5
## [147] BiocParallel_1.38.0               annotate_1.82.0                  
## [149] timechange_0.3.0                  gtable_0.3.5                     
## [151] rjson_0.2.21                      ggridges_0.5.6                   
## [153] progressr_0.14.0                  parallel_4.4.1                   
## [155] ape_5.8                           jsonlite_1.8.8                   
## [157] RcppHNSW_0.6.0                    TFBSTools_1.42.0                 
## [159] bitops_1.0-8                      assertthat_0.2.1                 
## [161] bit64_4.0.5                       Rtsne_0.17                       
## [163] yulab.utils_0.1.5                 spatstat.utils_3.0-5             
## [165] CNEr_1.40.0                       highr_0.11                       
## [167] jquerylib_0.1.4                   GOSemSim_2.30.0                  
## [169] shinyjs_2.1.0                     SeuratDisk_0.0.0.9021            
## [171] spatstat.univar_3.0-0             R.utils_2.12.3                   
## [173] lazyeval_0.2.2                    shiny_1.9.1                      
## [175] htmltools_0.5.8.1                 enrichplot_1.24.2                
## [177] GO.db_3.19.1                      sctransform_0.4.1                
## [179] rappdirs_0.3.3                    ensembldb_2.28.0                 
## [181] glue_1.7.0                        TFMPvalue_0.0.9                  
## [183] spam_2.10-0                       googlesheets4_1.1.1              
## [185] XVector_0.44.0                    RCurl_1.98-1.16                  
## [187] rprojroot_2.0.4                   treeio_1.28.0                    
## [189] BSgenome_1.72.0                   gridExtra_2.3                    
## [191] JASPAR2020_0.99.10                igraph_2.0.3                     
## [193] R6_2.5.1                          labeling_0.4.3                   
## [195] forcats_1.0.0                     RcppRoll_0.3.1                   
## [197] GenomicFeatures_1.56.0            cluster_2.1.6                    
## [199] Rhdf5lib_1.26.0                   gargle_1.5.2                     
## [201] aplot_0.2.3                       DirichletMultinomial_1.46.0      
## [203] DelayedArray_0.30.1               tidyselect_1.2.1                 
## [205] ProtGenerics_1.36.0               ggforce_0.4.2                    
## [207] AnnotationDbi_1.66.0              future_1.34.0                    
## [209] munsell_0.5.1                     KernSmooth_2.23-24               
## [211] htmlwidgets_1.6.4                 fgsea_1.30.0                     
## [213] RColorBrewer_1.1-3                rlang_1.1.4                      
## [215] spatstat.sparse_3.1-0             spatstat.explore_3.3-1           
## [217] fansi_1.0.6